We identified potential high suitable areas for a group of species triggering criteria A1 (threatened species) and B1 (geographic restricted species) aimed to inform the Key Biodiversity Areas (KBAs) initiative in Canada (see KBAs Canada and KBA standard). To do that, we first evaluated habitat suitability models relating species presence-only data sets and predictor variables, using the Maxent algorithm implemented in the ENMeval v.2.0.0 package (Kass et al 2021) in R. Then, we used cluster analysis to identify discrete polygons containing most of the high suitable areas.
We ran the analysis for a group of 96 species as a proof of concept aimed to provide recommendations for a full implementation (~ 1,200 species).
In this report we present these sections:
Scope of the analysis
This large-scale analysis offers the first approximation to identify potential areas that might harbor KBAs in Canada, for a large group of species of interest (~ 1,200 species).
The spatial resolution of this analysis is ~1 km2.
This macro-analysis will follow expert reviews and local-scale analysis with finer data sets.
It will help also to prioritize areas for further automated KBA assessments.
We used this approach to identify areas not only for species of ‘global interest’ (i.e., triggering A1 and B1 criteria) but for some species of ‘national interest’ (i.e., these species could trigger national A1 and B1 KBAs in Canada).
Because the geographic range of some species go across the border, we collected and analyzed data for North America as an area of study (Canada, US, and Mexico).
The selection of species for this first batch was based on the status of the EBAR-maps, as follow:
Reviewed by experts (128). We are going to start with this group as a proof of concept (we ran HSMs for 96 species).
Species with EBAR-maps partially reviewed by experts (98).
Species with EBAR-maps auto-generated (921).
This tool was designed to be updated and replicated (e.g., if new species and observations are added to the list). Scripts were developed in R and interactive outputs in Markdown.
Because this tool was designed to offer relevant information to a wide spectrum of users and decision makers and to facilitate their decision-making process, we provide results for each step of the process that may impact model reliability and usefulness of final outputs.
The workflow presented below describes the steps we follow to identify potential sites for all species triggering Criteria A1 and B1 (see KBA standard for additional information). We also offer a preliminary analyses of the sensitivity of these areas to climate change scenarios.
We developed a top-down approach that can:
Figures and maps presented here are only for illustration purposes.
1. We started the workflow by asking where are high suitable areas located within the geographic range of the species?
To do that we implemented Habitat suitability Models (HSMs) aimed to identify high suitable areas within the geographic range of each species. As geographic range proxy we used ecosystem-based automated range (EBAR) maps developed by NatureServe Canada to bound each HSM (see more information here). EBAR maps are comprised of jurisdiction approved Ecoshapes (generally Ecodistricts, Level IV Ecoregions, or similar) that are triggered by species observations. EBAR maps were also subject to a thorough process of review by experts on each of the species of interest.
2. Then, we were interested in knowing whether high suitable areas were clustered or dispersed across the landscape.
This is our first approximation to identify potential areas harboring most of the high suitable areas that might harbor KBAs.
3. Finally, we evaluated to what extent high suitable areas might change in the future under various climate change scenarios.
This additional evaluation will provide decision-makers with a sense of potential threats for species due to different scenarios of climate change.
We provide maps identifying where those changes (‘losing’ or ‘gaining’ suitable areas) might occur within the geographic range of each species.
In this section we present a summary of the general findings we encountered when running HSMs for a sample of 96 species triggering criteria A1 (threatened species) and B1 (geographic restricted species) aimed to inform the Key Biodiversity Areas (KBAs) initiative in Canada (see Figure below).
We also want to highlight some of the main challenges we encountered when running multiple HSMs.
Let’s start describing the outputs we obtained from this process.
1. Unable to run HSMs
There is a group of species where we were unable to run run HSMs. We identified three cases:
Computer performance limitations (n=34; 35%): we used a PC with 16GB in RAM and ran HSMs in parallel, using the ENMeval package option. It seems that it is not enough clusters for some species that have a broad geographic extent. Computer crashed after 4-5 hours. We recommend to move the analysis fr this group of species to a High Performance Computer (HPC). It is likely that this group of species will generate reliable models, because they have good number of observations.
Not enough observations (n=20; 21%): HSMs do not run with species with less than 3 observations (at least using th ENMeval package and Maxent Algorithm). We suggest to use alternative ways to inform KBAs initiative, such as: expert knowledge, current observations and/or critical habitat.
Some spatial issues between observations and EBARS (n=2; 2%): we were unable to run models for two species. It seems that observations and EBARs are in a different coordinates systems, so they do not overlap. This is something to review with NatureServe.
2. Species with HSMs predictions
For this group of species we were able to run HSMs and classified them based on one of the model performance metric “Omission rate (OR)”, as follow:
Omission rate (OR) indicates the “fraction of the test localities that fall into pixels not predicted as suitable for the species. A low omission rate is a necessary (but not sufficient) condition for a good model.”(Phillips et al., 2006 ). There is no thresholding rule developed yet to determine the optimal threshold for the omission rate, so we suggest this provisional relative scale (is the model potentially useful?):
There is high concern; OR > 0.50 in model predictions (n=13; 14%). Most of the species had 3 observations. However, some species with 7-10 observations performed similarly poor. We recommend to let the algorithm run the model (even with 3 observations) and use one of the performance metrics (e.g. OR) to decided model usefulness. We suggest to use alternative ways to inform KBAs initiative, using expert knowledge, current observations and/or critical habitat.
There is some concern; OR Between 0.25-0.50 in model predictions (n=7; 7%). This is a group of species with low number of observations (6-13) and intermediate values for OR. We recommend to treat this species also as a ‘concern in model predictions’. We suggest to use alternative ways to inform KBAs initiative, using expert knowledge, current observations and/or critical habitat.
Model predictions can be informative; OR <= 0.25 (n=20; 21%): This is a group of species with low to high number of observations (8-47) and high values for OR. We believe that these models can ofe some relevant information to inform KBAs initiative. We expect that running the group of species with computer performance limitations (n=34; 35%) in a HPC will result in an increase of the percentage of informative models to 40-50%. The reason: this group of species present huge number of observations. If true, we could expect at least 500 or so informative HSMs for the entire group of ~1,200 species.
List of species triggering A1 (threatened species) and B1 (geographically restricted species) criteria defined in the KBA standard. You can search for species using common or scientific names (in alphabetical order).
Colors refer to what we obtained after implementing HSMs. For instance:
Model predictions can be informative.
There is some concern in model predictions.
High concern in model predictions.
Unable to run model predictions: less than 3 observations.
Unable to run models (not enough memory): it requires to move HSMs analysis to High Performance Computer (HPC)
Unable to run model predictions: observations outside EBARs (may be different coordinates system).
Click on species link (if available) to see results.
In this section we offer a complete description of the methods, tools, data sets and analysis we used and applied in each step of the process.
Figures and maps presented here are only for illustration purposes.
1. HABITAT SUITABILITY MODELS (HSMs)
Introduction
We implemented Habitat Suitability Models (HSMs) to identify high suitable areas within the species geographic range (using EBARs as a proxy, see description below) and evaluate potential changes of these areas due to climate change.
HSMs are ecological tools that relate species observations (presence or presence-absence) and environmental (biotic or abiotic) predictor variables (e.g., climate, land cover, topography) to predict suitable habitat across a landscape or a region (Elith and Leatwick 2009).
We followed, as best as we could, the standards protocols for reporting HSMs proprosed by Zurrell et al 2020.
Model Objective
Two main modelling purposes in this study:
Identifying high suitable areas for all species of interest (i.e., objective = ‘Mapping’ sensu Araújo et al 2019); and,
Evaluating changes in habitat suitability due to future scenarios of climate change (i.e., objective = ‘Transfering’ sensu Araújo et al 2019).
Data type
Observations
We used presence-only data from NatureServe Canada that includes multiple sources and different types of features.
Data sharing agreements
Due to data sharing agreements we will not display in this tool any features representing occurrences used in this analysis.
Similarly, we also excluded the following data from HSMs modelling:
Unobscured versions of iNaturalist obscured records.
Canadian Wildlife Service records.
Records for British Columbia species subject to persecution and harm.
Our coarse analysis and outputs (grid cell size ~ 1km2) is complying with data providers requirements of data outputs generalization. That is, cell must be generalized to 1000m in any external products generalization.
Data providers
Species Geographic range maps
We used ecosystem-based automated range (EBAR) maps developed by NatureServe Canada to bound each HSM (see more information here). EBAR maps are comprised of jurisdiction approved Ecoshapes (generally Ecodistricts, Level IV Ecoregions, or similar) that are triggered by species observations. EBAR maps were also subject to a thorough process of review by experts on each of the species of interest.
As August 2021, this is the list and status of EBARs for species triggering A1 and B1 criteria.
reviewed by experts (128). We are going to start with this group of species as a proof of concept.
Species with EBAR-maps partially reviewed by experts (98).
Species with EBAR-maps auto-generated (921).
Predictors
We used raster data sets for environmental predictors available online at similar scales and without any spatial gaps in North America.
Spatial resolution
HSMs outputs are generated at ~ 1Km2 spatial resolution. Note that all maps were projected to North American Albers Equal Area Conic to preserve the sense of area.
Taxon/species level
Single species models.
Algorithm
We implemented Habitat Suitability Models (HSMs) to predict suitable habitat across the species geographic distribution (Elith and Leatwick 2009). HSMs are ecological tools that relate species observations (presence or presence-absence) and environmental (biotic or abiotic) predictor variables (e.g., climate, land cover, topography). These empirical models rely on correlative relationships between species occurrences and environmental variables (Guisan and Zimmermann 2000).
Regression models (e.g., Generalized linear models-GLMs- and Generalized Additive Models-GAMs-) or ensemble models (e.g., random forest and boosted regressions) have been used for presence-absence datasets; however these ecological surveys are rare. Presence-only dtasets are more frequent in ecology, so alternative methods have been prososed (Wisz et al. 2008). One of this algorithms is the maximum entropy modeling of species geographic distributions algorithm (MaxEnt, see Phillips et al. 2006), with a ‘maxent’ function developed in the ‘dismo’ package in R.
MaxEnt is an algorithm that has been reported to have a high predictive performance compared with other high performing methods (Hernandez et al. 2006). One of the advantages is that Maxent uses presence-only data and is able to cope well with sparse and irregularly sampled data, circumventing the issue of lack of true presence and absence observations in field surveys (Wisz et al. 2008) and (Hernandez et al 2006). Maxent is also “preferable when modelling species that are narrowly distributed and from which not many record locations are available” (Aguirre-Gutierrez et al. 2013) and (Hernandez et al 2006). In addition, the habitat suitability output is a continuous map allowing for the distinction between high and low suitable areas for the species (Phillips et al. 2006).
The principle, from (Phillips et al 2006):
“When approximating an unknown probability distribution, the question arises, what is the best approximation? E.T. Jaynes gave a general answer to this question: the best approach is to ensure that the approximation satisfies any constraints on the unknown distribution that we are aware of, and that subject to those constraints, the distribution should have maximum entropy (Jaynes, 1957). This is known as the maximum-entropy principle.”
For additional information on Maxent see (Elith et al. 2013, Merow et al. 2013, (Phillips et al. 2006) and Philips et al. 2017).
We used the ENMeval v.2.0.0 package (Kass et al 2021 and see ENMeval Vignette here) to implement the Maxent algorithm. ENMEval provides functionality to model tuning, methods for partitioning occurrence data and reports multiple performance metrics (Kass et al 2021).
Extent of the analysis
We used the high resolution map of North America from Global Self-consistent, Hierarchical, High-resolution Geography Database (GSHHG):
We consider the entire geographic range of all species, including data sets for North America for those species going across the Canadian border (see an example below).
Access to data sets: NatureServe Canada data, August/2020- October/2021.
a. Type of data
We used data sets in different formats.
Points: all points data.
Polygons: element of occurrence (EOs) and critical habitat.
Element of occurrence (EOs): EOs represent an area of land and/or water in which a species or natural community is, or was, present (see NatureServe for details).
Critical habitat: “Critical habitat is defined under section 2 of SARA as:”the habitat that is necessary for the survival or recovery of a listed wildlife species and that is identified as the species’ critical habitat in the recovery strategy or in an action plan for the species“. Section 49(1)(a) of SARA requires that a species’ Recovery Strategy/Action Plan include an identification of the species’ critical habitat to the extent possible, based on the best available information, including information provided by the Committee on the Status of Endangered Wildlife in Canada (COSEWIC)” (see Species at Risk Act, SARA).
Lines: transect surveys.
b. Converting polygons and lines to points
HSMs’ algotithms require observations as a ‘points’ features. So, polygons and lines where converted to raster, using a raster template at ~50m spatial resolution. For lines we applied first a buffer of 25m to convert them to polygons. Then, we converted these grid cells to points, using a ‘centroid’ option within each ~ 1Km2 grid cell. This procedure has a partial effect for correcting sampling bias (Fourcade et al 2014).
Finally, we joint all these points sources when available.
c. Spatial filtering
To avoid issues of spatial sampling and reduce spatial autocorrelation in our HSMs estimates we thinned the occurrence data set of each species using a thinning algorithm (implemented in the spThin package by Aiello-Lammens et al 2015), using a distance of 5 Km (we tested various distances, 1, 3, 5). Spatial filtering can also improve model performance (Boria et al 2014).
We included four groups of predictors with data available across North America: topographic heterogeneity, dynamic habitat indices, climate and distance to water.
a. Topographic heterogeneity (TH)
We used two indices of topographic heterogeneity developed by Amatulli et al 2018. Access date (September 17, 20) on this website.
Derived topographic continuous variables (see description here):
| DATASET | AGGREGATION | SOURCES | RESOLUTION | Date downloaded |
|---|---|---|---|---|
| Vector Ruggedness Measure (VRM) | Mean | GMTED2010 | 1km | September 17, 2020 |
| Roughness | Mean | GMTED2010 | 1km | September 17, 2020 |
| Eastness | Mean | GMTED2010 | 1km | August 2, 2021 |
| Northness | Mean | GMTED2010 | 1km | August 2, 2021 |
| Slope | Mean | GMTED2010 | 1km | August 2, 2021 |
GMTED2010: Global Multi-resolution Terrain Elevation Data 2010 dataset (250m resolution).
Vector Ruggedness Measure (VRM): it “quantifies terrain ruggedness by measuring the dispersion of vectors orthogonal to the terrain surface. Slope and aspect are decomposed into 3-dimensional vector components (in the x, y, and z directions) using standard trigonometric operators, and by calculating the resultant vector magnitude within a user-specified moving window size (in this study 3 × 3).” (Amatulli et al 2018).
Roughness: “Roughness is expressed as the largest inter-cell difference of a focal cell and its 8 surrounding cells.” (Amatulli et al 2018).
Northness: “a northness value close to 1 corresponds to a northern exposition on a vertical slope (i.e., a slope exposed to very low amount of solar radiation), while a value close to -1 corresponds to a very steep southern slope, exposed to a high amount of solar radiation.”
a.1. Why topographic heterogeneity is important?
TH is expected to increase environmental gradients and create multiple niche ecological opportunities. TH can influence the sub-regional variation of abiotic factors (micro/macroclimates, soil composition, hydrological systems) and biotic factors (number of species, population dynamics and species movements, among others, (Amatulli et al 2018)). In addition, TH can provide shelter and refuge for species and/or small populations during periods of adverse environmental conditions and/or drastic climate change. It has been suggested that heterogeneous environments , offering more potential niches, can promote the coexistence of species therefore might have some effect on the number of species we can find in a site (Habitat heterogeneity hypothesis, Pianka 1966, Cramer and Willig 2004, and empirical evidence in Stein et al 2014), among other factors (e.g., Allouche et al 2012).
Studies have also suggested that TH in concert with other environmental variables can affect geographic range size (Janzen 1967 and McCain 2009 ), species turnover (Buckley and Jetz et al 2008 and Fjeldså et al 2012), endemism (Steinbauaer et al 2016) and life-history characteristics (Bears et al 2009).
Slope aspect can significantly influence microclimate and hydrological process that might result in distinct vegetation types, species turnover and species habitats (Yang et al 2020).
b. Dynamic Habitat Index (DHI)
The Dynamic Habitat Index (DHI) is an integrated metric of vegetation production from satellite imagery that measures the fraction of photosynthetically active radiation (or fPAR) intercepted by vegetation (Coops et al 2008). DHI integrates three measures of vegetation productivity a) cummulative annual productivity, b) the minimum level of vegetation cover, and c) the degree of seasonality (Radeloff et al 2019). These indices are strongly correlated at global scale, however, local differences emerge among indices when considering geographic ranges of the species.
Access date: November, 2019 Spatial resolution: 0.0083 decimal degrees (~ 1 Km resolution) Datasets downloaded from: (DHI, from the Silvis Lab).
For each index we stacked all years (2003-2015) and calculated the mean.
b1. Cummulative annual productivity (CAP, or total greenness, band_1)
It has been shown that CAP accuratelly predict species range size (Nilsen et al 2005), species richness (Hobi et al 2017 and, Radeloff et al 2019), beta diversity (Andrew et al 2012), species abundances (Michaud et al 2014) and productivity across various cover types (Coops et al 2008). It has been also used in conservation planning (Powers et al 2012).
b2. Minimum Anual Productivity (MAP, band_2)
“The continual provision of food and habitat resources throughout the year is of particular interest to wildlife conservationists, as changes in the amount and quality of available vegetative cover influences behavior of many herbivorous species, and ultimately, the carnivorous species which prey upon them (Schwartz et al. 2006). Locations without significant snow cover in autumn will maintain a cover of green biomass into winter providing accessible food resources and habitat.” (Coops et al 2008).
b3. Variation of Annual productivity (MAP, band_3)
“Sites with low seasonality are indicative of areas that have consistent vegetation production throughout the year such as evergreen forests” (Coops et al 2008).
c. Climate
Nineteen (19) bioclimatic variables from Worldclim (Hijmans et al 2005) were used in this analysis for both ‘Historical’ and ‘Projected’ climate.
c.1. Historical climate
c.2. Projected climate
For each projected bioclimatic variable (n=19), we downloaded data for three Global Circulation Models (GCMs) and for two Representative Concentrations Pathwways (RCPs), as described below. Then, we averaged all GCMs within each RCP for each variable. This average was used to implement HSMs for future climate scenarios. In addition, and as a measure of uncertainty, we calculated the coefficient of variation (cv, a measure of dispersion among all GCMs) for each bioclim variables. We expect low values of ‘cv’ across space for all bioclim variables, meaning that low uncertainty and realibility of the data sets.
Collaborative framework:
We used the Coupled Model Intercomparison Project (CMIP)-Phase 5 which is a collaborative framework aimed to improve the knowledge on climate change (CMIP). We will update this as soon as the new CMIP6 datasets are available online.
Global Circulation Models (GCMs):
GCMs are mathematical models of the general circulation of the atmosphere and allow predictions of future scenarios of cliamte change (GCMs). We screened and selected GCMs based on previous published evaluations (e.g., Sheffield et al 2013) and also on information available for all bioclimatic variables. These are the GCMs selected:
Representative Concentrations Pathwways (RCPs):
RCPs are greenhouse concentration (not emissions) trajectories adopted by IPCC, describing different climate futures (as shown in the graph below and see data here).
We used two contrasting RCPs for the analysis:
c.3. Why climate is important?
Climate has shown to be an important driver of species distributions (e.g., Araujo et al 2005, Davies et al 2011, and Vásquez & Currie 2014), and climate change might threatens conservation areas and biodiversity (Araujo et al 2011 and Newbold 2018).
d. Water We recognized the importance of freshwater bodies for species presence, so we constructed two metrics using lakes from HydroLakes v1.
d1. Distance to water
We calculated the distance to from every single grid cell in the study region to the nearest grid cell representing a lake.
d2. Percentage of grid cells (100Km2) that represent lakes within a 1Km2 grid cell
We calculated the percentage of grid cells (100Km2) that repesent lakes within a 1Km2 grid cell.
Back to top1.3.1. Multicollinearity
We used a correlation coefficient threshold of 0.7 (Dormann et al 2012) to select a set of uncorrelated variables, using removeCollinearity fucntion from the virtualspecies package in R.
1.3.2. Evaluating multiple models
We used the ENMeval package (Muscarrella et al 2014) to evaluate a series of candidate models with various settings (see ENMeval Vignette here), including multiple features classes (FCs; adding more FCs enables more complexity) and minimizing overfitting via regularization multipliers (RMs; penalizing “the inclusion of additional parameters that result in little or no ‘gain’ to the model” Muscarrella et al 2014). The ENMeval package implements the MaxEnt algorithm.
1.3.3. Number of observations and model performance
Model building (e.g., type of features, partition method; see next section) and performance is mediated by the number of species occurrences available.
Although Maxent has been reported to perform well among various algorithms (Hernandez et al. 2006) and is less sensitive to sample size; there is also some concern that this algorithm (and others) does not predict “consistently well with small sample size (n < 30) and this should encourage highly conservative use of predictions based on small sample size and restrict their use to exploratory modelling” (Wisz et al 2008).
Some studies have reported that the minimum number of occurrences required to obtain good model fitting range from 14 for range restricted species to 25 for widespread species (van Proosdij et al 2015). Others, based on empirical studies have recommended 13 as the minimum sample size (Randal, personal communication).
There is also a general rule of thumb that the minimum sample size (number of occurrences from presence-only data sets) should be 10 times larger the number of variable predictors to reduce model overfitting (Harrell et al 1996). However, for species with small number of occurrences, maintaining this ratio (1:10) is quite challenging (e.g., species with 30 occurrences will allow to include only 3 predictors). A promising area of study is the implementation ensemble bivariate small models (Brenier et al 2015), that can circumvent the limitation of small number of occurrences. Now, Maxent standard methods seems to still perform well for this exploratory analysis.
Here, we allowed the algorithm to run the analysis and identified model usefulness based on performance metrics.
1.3.4. Model settings
a. Optimization of model parameters
Because it is computationally prohibitive including all potential feature combinations within a model, we constructed models including as much complexity as we could (i.e., combining FCs and RMs). For instance,in complex models we combined 5 FCs and 3 RMs for a total of 15 models.
d1. Features types
“The idea of Maxent is to estimate a target probability distribution by finding the probability distribution of maximum entropy (i.e., that is most spread out, or closest to uniform), subject to a set of constraints that represent our incomplete information about the target distribution. The information available about the target distribution often presents itself as a set of real-valued variables, called “features”, and the constraints are that the expected value of each feature should match its empirical average (average value for a set of sample points taken from the target distribution). When Maxent is applied to presence-only species distribution modeling, the pixels of the study area make up the space on which the Maxent probability distribution is defined, pixels with known species occurrence records constitute the sample points, and the features are climatic variables, elevation, soil category, vegetation type or other environmental variables, and functions thereof." (Phillips et al 2006).
In summary, “features” constrain the computed probability distribution and represent ecological assumptions (i.e., predictors that cosntrain the geographical distribution of the species (Phillips et al 2006)
The available feature types are and see description in (Li et al 2020):
Linear (L): “Linear features constrain the output distribution for each species to have the same expectation of each of the continuous climate variables as the sample locations for that species. A linear feature is simply one of the continuous climate variables”.
Quadratic (Q):“Quadratic features (when used together with linear features) constrain the output distribution to have the same expectation and variance of the climate variables as the samples. A quadratic feature is the square of one of the continuous climate variables”.
Product (P): “A product feature is the product of two continuous climate variables; when used with linear and quadratic features, product features constrain the output distributions to have the same covariance for each pair of climate variables as the samples”.
Threshold (T): "A threshold feature is derived from a continuous climate variable. For a threshold value v, the threshold feature is binary (taking values 0 and 1) and is 1 when the variable has value greater than
d2. Features types and observations
We selected the type of features according to the number of observations (Merow et al 2013), as follow :
| Features_type | Number of Observations |
|---|---|
| None | 1 |
| Linear(L) | >2 |
| Quadratic(Q) | >10 |
| Hinge(H) | |
| Threshold(T) | >80 |
| Product(P) | >80 |
| — | — |
d3. Feature combinations (FCs)
Note1: “Linear features are special cases of hinge features, so it is redundant to use L and H features simultaneously” (Phillips and Dudík 2008).
Remember that the number of parameters retained in the final best model is a function of the number features created in model fitting, which in turn depends on the number of observations and predictors. For instance,
Linear features: one per predictor
Product features: it is calculated as follow: predictors!/pairs!(Observations-2)
Feature combinations can lead to high number of features created in model fitting, therefore more parameters in the model. “For example, if there are 100 presence observations the number of piecewise features is given by: 3 (types of piecewise functions) × 99 (pairs of data point) × 19 (predictors) = 5643. This collection of features is explored by MaxEnt, and the most useful features are extracted.” (Merow et al 2013).
d4. Regularization multiplier (RM)
The RM prevents model over-fittig and smooths the distribution, making it more regular.
“A smaller RM value will result in a more localized output distribution that fits the given occurrence records better, but is more prone to overfitting. Overfitted models are poorly transferable to novel environments and, thus, not appropriate to project distribution changes under environmental change. A larger RM value will produce a prediction that can be applied more widely.” (Li et al 2020).
“More complex feature classes will require more regularization to yield accurate predictions.” Phillips and Dudík (2008)
We used regularization values (RMs) proposed by (Phillips and Dudík 2008), based on number of observations. We evaluated these RMs.
d5. Data partitioning method
d6. Model complexity
Note that it is computationally prohibitive including all potential feature combinations within ENMeval evalution (i.e., more complexity). So, we tried to include and evaluate as much compleity as we could. As a result, the more complex set of candidates models is 15, combining 5 FCs (Q, H, LQ, LQP, QPT) and 3 RMs (0.05, 0.5, 1).
d7. The effect of number of background points and sampling bias
We selected the best model within each of these six groups of models based on the lowest AICc value (i.e., six best models). Then we selected from this list of six models the ‘best model’ based on the highest AUC value. Further analysis were beased on this ‘best model’.
Background points (bg_points).
Minimizing sampling bias. Sampling bias is the result of different sampling efforts across environmental space, resulting in models predicting more sampling effort than habitat suitability areas for the species (Moua et al 2020). Phillips et al (2009) proposed that choosing background data with the same bias as presence data can minimize sampling bias. Here we constructed a sampling bias layer using observation points (i.e., two-dimensional kernel density estimate, using the kde2d function from MASS package).
a. Approach
We implemented a sequential method to select the best model using two performance metrics. This approach uses cross-validation results by selecting models with the lowest average test omission rate, and to break ties, with the highest average validation AUC (Kass et al 2020 and Radosavljevic & Anderson 2014). Further analysis were based on this ‘best model’.
Omission rate (OR) indicates the “fraction of the test localities that fall into pixels not predicted as suitable for the species. A low omission rate is a necessary (but not sufficient) condition for a good model.”(Phillips et al., 2006). There is no thresholding rule developed yet to determine the optimal threshold for the omission rate, so we suggest this provisional relative scale (is the model potentially useful?): (0 - 0.25, the model can be informative; 0.25 - 0.50, there is some concern in model predictions; > 0.50 there is high concern in model predictions). Omission rate values(or.10p.avg).
AUC values are “the probability that a random positive instance and a random negative instance are correctly ordered by the classifier.” (Phillips et al., 2006). AUC values above 0.75 are considered potentially useful (Elith. 2002 and Phillips and Dudík 2008). AUC values(auc.val.avg).
The graph below compares the average test omission rate (y-axis) across multiple regularization values (rm, x-axis). Features (groups in colors) with NA values are not shown in the graph, see summarizing tables below for details).
The graph below compares the average validation AUC (y-axis) across multiple regularization values (rm, x-axis). Features (groups in colors) with NA values are not shown in the graph, see summarizing tables below for details).
Note that AUC values above 0.75 (blue dashed line) are considered potentially useful (Elith. 2002 and Phillips and Dudík 2008).
b. Variables importance
An example is presented below for variable importance:
c. Best model (parameters)
We used the eval.models function from the ENMeval package to identify the best model.
The table below presents a summary of model parameters for the best model. We used this model for further analyses.
We used the eval.predictions function from ENMeval to create the habitat suitability map from the best model(see figure below).
The map shows a measurement of uncertainty based on the coefficient of variation among 10 HSA maps generated in model fitting.
We used R version 3.6.3 [64-bit] for all analyses (R Core Team, 2018) and the following R packages:
‘dismo’ was used to run HSMs.
‘forcats’, was used for reordering numbers
‘ggplot2’ was used to create graphics.
‘raster’ was used to manage raster data sets (Hijmans and van Etten 2014).
‘sf’ was used to manage vector data sets.
‘virtualspecies’ was used to remove collinearity among variables within a raster stack (Leroy et al. 2016).
spThin, was used for spatial filtering (i.e., thinning the data set to reduce the effect of sampling bias and spatial autocorrelation) Aiello-Lammens et al 2015.
ENMeval package (Muscarrella et al 2014) to evaluate a series of candidate models with various settings (see ENMeval Vignette here).
“readxl”: to read read Excel tables
“arcgisbinding” to transfer data from ArcGIS NatureServe dataset in ArcGIS Pro) to R.
devtools::install_github(“kapitzas/WorldClimTiles”): donwnload and merge bioclim tiles.
MASS::kde2d: to create density map based on observations.
We used Kernel analysis (three radius) to evaluate whether high suitable areas (HSA = 0.75 to 1.00) were clumped, allowing the identification of discrete potential sites to harbor KBAs.
We used Kernel density estimator by transforming high suitable areas into points and computing an isotropic kernel intensity estimate of the point pattern. We evaluated various radious (3 Km, 4 Km and 5Km) to calculate the density of points within that radious-window.
We tested here the highest percentile of the HSA map (HSA = 0.75 to 1.00).
Preliminary analysis showed that 5Km window better differentiate clusters within the landscape.
R packages
‘spatstat’ was used this for kernel density analysis.
‘stars’ was used this to convert raster to polygon.
The effects of climate change on species distributions have been broadly documented (e.g., McKenney et al 2011, Pacifici et al 2015) and also the ability of species to track climate change (e.g., Schloss et al 2012). Therefore, high suitable areas identified by HSMs under historical conditions (current climate) can be different in the future.
We compared models using current climate with two projected climate scenarios:
We also included the spatial distribution of the major climate changes within the geographic range of each species.
We calculated the spatial distribution of the major climate changes within the geographic range of each species, using this formula:
HSA change = current HSA - projected HSA.
‘Lossing (-)’ suitable areas are shown in red, while blue areas indicate ‘gaining’ suitable areas in the future. Yellow color (zero values) indicates no change at all in the suitable areas.
Juan Zuloaga wants to thanks funding from Liber-Ero – Biodiversity Conservation
We would like also to thanks some researcher that contributed to this research:
Planning:
Datasets:
HSMs reviewrs:
Software troubleshooting:
Guillaume Larocque (McGill University) Val Lucet (McGill University) Gonzalo Pinilla (CUNY)
Back to top